-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add str.normalize()
#20483
feat: Add str.normalize()
#20483
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #20483 +/- ##
==========================================
- Coverage 78.96% 78.85% -0.12%
==========================================
Files 1557 1559 +2
Lines 220743 221124 +381
Branches 2527 2527
==========================================
+ Hits 174318 174363 +45
- Misses 45847 46183 +336
Partials 578 578 ☔ View full report in Codecov by Sentry. |
This kernel should not be written by collecting to a temporary pub fn normalize_with<F: Fn(&str, &mut String)>(ca: &StringChunked, normalizer: F) -> StringChunked {
let mut buffer = String::new();
let mut builder = StringChunkedBuilder::new(ca.name().clone(), ca.len());
for opt_s in ca.iter() {
if let Some(s) = opt_s {
buffer.clear();
normalizer(s, &mut buffer);
builder.append_value(&buffer);
} else {
builder.append_null();
}
}
builder.finish()
}
pub fn normalize(ca: &StringChunked, form: UnicodeForm) -> StringChunked {
match form {
UnicodeForm::NFC => normalize_with(ca, |s, b| b.extend(s.nfc())),
UnicodeForm::NFKC => normalize_with(ca, |s, b| b.extend(s.nfkc())),
UnicodeForm::NFD => normalize_with(ca, |s, b| b.extend(s.nfd())),
UnicodeForm::NFKD => normalize_with(ca, |s, b| b.extend(s.nfkd())),
}
} |
Thanks @orlp, I naively followed Updated benchmark:
(Can't really explain the change in magnitude compared to the first one but the gap between polars and pandas now is consistently there) |
This comment was marked as outdated.
This comment was marked as outdated.
Thanks for your first contributions @etiennebacher. Before implementing features, we should first decide if we want them. (This is shown by the accepted tag). For one, I am not entirely sure that we do want this in the main library. It seems quite a large dependency (with all the unicode tables), which might be better suited for a plugin. Let me get back to this, I want to see how much this dependency adds and how important of a feature this is. |
Sure, no problem with letting this be a plugin functionality. I don't mind this being closed, but no matter the outcome the two issues mentioned in the original post should be updated. |
This functionality is already in the polars-ds extension: |
cb023f8
to
8497bb9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, I have looked at the wheel size, and at how core this is and I think this is worth it.
The PR looks great @etiennebacher. Thanks. 👍
Ah, I see there is 1 mypy lint. Can that be fixed. |
… into str_normalize
Thanks @ritchie46, the mypy failure is fixed |
Contributing to the Rust part for the first time so there are probably some quirks here and there. I used the suggestion in #11455 to use the
unicode_normalization
crate and mostly followed #12878. I don't know if you want to add this function or to implement it that way but it was good training for me anyway.Note that I'm not very familiar with this method so double-checking the output and maybe adding more corner cases to the test suite would be nice.
Quick performance check after
make build-release
:A bit disappointed with the performance, maybe I missed something obvious. There are also a couple of issues on performance in the Rust crate used: https://github.com/unicode-rs/unicode-normalization/issues?q=sort%3Aupdated-desc+is%3Aissue+is%3Aopen+performance
Fixes #5799
Fixes #11455